1. The Data


2. EDA

2.1 First look at the data

fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
7.4 0.70 0.00 1.9 0.076 11 34 0.9978 3.51 0.56 9.4 5
7.8 0.88 0.00 2.6 0.098 25 67 0.9968 3.20 0.68 9.8 5
7.8 0.76 0.04 2.3 0.092 15 54 0.9970 3.26 0.65 9.8 5
11.2 0.28 0.56 1.9 0.075 17 60 0.9980 3.16 0.58 9.8 6
7.4 0.70 0.00 1.9 0.076 11 34 0.9978 3.51 0.56 9.4 5
7.4 0.66 0.00 1.8 0.075 13 40 0.9978 3.51 0.56 9.4 5

2.2 Tidy up the columns names

Now the names are seprated with (_) instead of a space. and be in lower case.

2.3 Summary statistics

Skim is a very nice package for summary statistics Skim.

skim(wine_clean)
Data summary
Name wine_clean
Number of rows 1599
Number of columns 12
_______________________
Column type frequency:
numeric 12
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
fixed_acidity 0 1 8.32 1.74 4.60 7.10 7.90 9.20 15.90 ▂▇▂▁▁
volatile_acidity 0 1 0.53 0.18 0.12 0.39 0.52 0.64 1.58 ▅▇▂▁▁
citric_acid 0 1 0.27 0.19 0.00 0.09 0.26 0.42 1.00 ▇▆▅▁▁
residual_sugar 0 1 2.54 1.41 0.90 1.90 2.20 2.60 15.50 ▇▁▁▁▁
chlorides 0 1 0.09 0.05 0.01 0.07 0.08 0.09 0.61 ▇▁▁▁▁
free_sulfur_dioxide 0 1 15.87 10.46 1.00 7.00 14.00 21.00 72.00 ▇▅▁▁▁
total_sulfur_dioxide 0 1 46.47 32.90 6.00 22.00 38.00 62.00 289.00 ▇▂▁▁▁
density 0 1 1.00 0.00 0.99 1.00 1.00 1.00 1.00 ▁▃▇▂▁
p_h 0 1 3.31 0.15 2.74 3.21 3.31 3.40 4.01 ▁▅▇▂▁
sulphates 0 1 0.66 0.17 0.33 0.55 0.62 0.73 2.00 ▇▅▁▁▁
alcohol 0 1 10.42 1.07 8.40 9.50 10.20 11.10 14.90 ▇▇▃▁▁
quality 0 1 5.64 0.81 3.00 5.00 6.00 6.00 8.00 ▁▇▇▂▁

We can see the following things:

  1. all columns are numeric
  2. no missing values
  3. it is not a big dataset

2.4 Univariate

2.4.1 Quality Distribution

It appears that is this sample dataset, the most frequent wine quality are 5 and 6.

2.4.2 PH Distribution (boxplot)

2.4.3 All data Distribution

2.4.2 Alcohol Distribution (boxplot)

2.5 Bivariate

2.5.1 GGPair graph (group by quality)

wine_ggpair <- wine_clean
wine_ggpair$quality <- as.factor(wine_ggpair$quality)
ggpairs(wine_ggpair,columns = 1:6,aes(colour = quality, alpha = 0.4), upper=list(continuous = "blank"))

wine_ggpair2 <- wine_clean
wine_ggpair2$quality <- as.factor(wine_ggpair2$quality)
ggpairs(wine_ggpair2,columns = 7:11,aes(colour = quality, alpha = 0.4), upper=list(continuous = "blank"))

2.5.2 Correlation (heatmap)

  • Alcohol and Quality are inversely highly correlated.
  • fixed acidity is highly correlated with Ensity and Citric acid
  • total sulfur dioxide is highly correlated with free sulfur dioxide
## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

2.6 Mean values grouped by the quality

wine_df <- wine_clean %>% group_by(quality) %>% summarise(across(everything(), mean))
wine_df <- signif(wine_df, digits = 3)
reactable(wine_df)

2.7 2D interractive graph

plot <- ggplot(data=wine_clean, aes(alcohol,fixed_acidity)) +
geom_point(aes(color=quality)) +
ggtitle("2D interractive graph")

ggplotly(plot)

3. Conclusion

In this report i have done some EDA for the Red Wine Quality from Kaggle. And it shows that some of the features like Alcohol and fixed_acidity may be good to build a prediction models the quality of the wine. And based on the heat map there were some inverse correlation between the values.